Collection and curation of prokaryotic genome assemblies from type strains at NCBI

The public sequence databases are entrusted with the dual responsibility of providing an accessible archive to all submitters and supporting data reliability and its re-use to all users. Genomes from type materials can act as an unambiguous reference for a taxonomic name and play an important role in comparative genomics, especially for taxon verification or reclassification. The National Center for Biotechnology Information (NCBI) collects and curates information on prokaryotic type strains and genomes from type strains. The average nucleotide identity (ANI)-based quality control processes introduced at NCBI to verify the genomes from type strains and improve related sequence records are detailed here. Using the curated genomes from type strains as reference, the taxonomy of over 1.1 million GenBank genomes were verified and the taxonomy of over 7000 new submissions before acceptance to GenBank and over 1800 existing genomes in GenBank were reclassified.


DATA SUMMARY
Detailed descriptions of each file and the file contents are provided in the corresponding README file in the same directory. The following files are part of the public NCBI Taxonomy dump files (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_ taxdump): (1) excludedfromtype. dmp -a manually curated list of strains that were listed incorrectly as type strains in the literature or other public resources. (2) typematerial. dmp -list of type materials along with the corresponding NCBI Taxonomy numeric-identifier (TaxId) and whether the type material is from a heterotypic synonym.
The following files are part of the public NCBI Genomes FTP files (https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS): (1) ANI_ report_ prokaryotes. txt -This file contains ANI data for all latest prokaryotic genome assemblies in GenBank.
(2) prokaryote_ type_ strain_ report. txt -This file provides information about every prokaryote in NCBI Taxonomy that has a type strain and/or a genome assembly (of type and/or non-type strains). (3) prokaryote_ without_ type_ assembly. txt -This file provides information about prokaryotes in NCBI Taxonomy that have type strains but do not have assemblies available from type strains. (4) prokaryote_ ANI_ type_ not_ matching. txt -This file provides information about assemblies from type strains that appear to be significantly different from the other assemblies available for the same organism. (5) prokaryote_ ANI_ species_ specific_ threshold. txt -This file provides species-specific ANI thresholds for species that do not use the default ANI threshold of 96 % to define species boundaries.

INTRODUCTION Type materials
The public sequence databases are confronted with the dual charge of providing an accessible archive to all submitters and supporting data reliability and its re-use to all users. This makes improving the trustworthiness and provenance of public records a long-standing challenge. The databases in the International Nucleotide Sequence Database Collaboration [INSDC, whose members are GenBank, the European Nucleotide Archive (ENA) and the DNA Databank of Japan (DDBJ)] [1] as well as additional resources at the National Centre for Biotechnology Information (NCBI) such as RefSeq [2] use a central classification and nomenclature resource -the NCBI Taxonomy [3]. This resource is structured around formalized taxonomic names governed under several codes of nomenclature that extensively lay out how names and species descriptions are published. The codes also define how to document a 'type' , an element (usually a specimen or culture) to which the name of a species is permanently attached.
The introduction of type material annotations within NCBI Taxonomy in 2014 marked an important development in how taxonomic assignments can be validated in NCBI databases [4]. This unlocked the ability to implicitly connect species names to physical vouchers and public sequence records. Since a sequence obtained from type material can act as an unambiguous reference for a taxonomic name and any associated species concepts, it becomes possible to compare and adjust names of closely related records, according to their similarity [5].
The term 'type material' as official vocabulary by the INSDC (www.insdc.org/controlled-vocabulary-typematerial-qualifer) includes several variations on 'type' as defined in the relevant codes, with 'type material' collectively referring to all. The International Code of Nomenclature of Prokaryotes (ICNP), governing most bacterial and archaeal names, requires that for each species a designated strain be 'maintained in pure culture' that 'should agree closely to its characters with those in the original description' [6]. This type strain is required to be placed in at least two publicly accessible culture collections in different countries. These strains are frequently swapped with additional repositories, and all subcultures are subsequently referred to as co-identical type strains. Because of all these actions there is a potential for errors and contaminants to be introduced.
The NCBI Taxonomy group and their colleagues maintain and update published information related to type strains and other type material according to the rules of relevant codes of nomenclature [7]. This results in a list of co-identical type strains attached to validly published names along with their relevant publications. For example, Kitasatospora aureofaciens has at least 40 known co-identical type strains listed in NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1894) and therefore a comprehensive comparison for any sequence derived from these type strains is required in order to discover any potential errors.
At NCBI, genome assemblies obtained from type strains (referred to as 'type assemblies') are utilized in computational comparisons, e.g., using average nucleotide identity (ANI) to make changes with reasonable confidence, such as reclassifying or modifying existing taxonomy. Like other assemblies, type assemblies are also prone to errors, such as mislabeling, contamination or sequencing issues, but unlike them, these problems will have harmful effects on many other genomes. As a result, particular care is taken when annotating type assemblies. The processes introduced at NCBI to verify the type assemblies and improve related sequence records are detailed here.

Annotating assemblies from type materials
NCBI Taxonomy maintains a manually curated list of validly and effectively published species names as defined by the ICNP [6] and type strains along with the relevant publications. NCBI curators rely on information in publications and important online resources such as the List of Prokaryotic names with Standing in Nomenclature (LPSN) managed by curators at the Leibniz Institute [8]. NCBI Taxonomy also maintains a manually curated list of strains that were listed incorrectly as type strains in the literature or other public resources. The list can be found in the FTP file, excludedfromtype. dmp, which can be found as part of the public NCBI Taxonomy dump files (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump), detailed in [3]. The strains in this list and assemblies from these strains were excluded from all our analyses. Type strain names along with the corresponding NCBI Taxonomy numeric-identifier (TaxId; specifically indicated as NCBI:txid <number>) and whether the type strain is from a heterotypic synonym was obtained from the FTP file, typematerial. dmp, which can be found as part of the public NCBI Taxonomy dump files. Additional information about the FTP files is available as part of the general taxdump readme file (https://ftp.ncbi. nlm.nih.gov/pub/taxonomy/new_taxdump/taxdump_readme.txt).
Using publicly available search options, a list of GenBank sequence records along with their available metadata, such as the species and the strain name, can be obtained. AND "DSM 12442" [strain]. NCBI GenBank sequences were then mapped to their corresponding NCBI Assembly.

Assessing average nucleotide identity (ANI) and species-specific thresholds
On the basis of our previous analyses, an ANI threshold of 96 % was used as a default value to define species boundaries [5]. However, there were cases in which a custom species-specific threshold, higher or lower than the default 96%, was required. A higher threshold was used when the species were closely related, and a lower ANI threshold was used for species with broader genomic diversity. Most custom ANI thresholds were automatically determined and then reviewed and approved by the taxonomy curators at NCBI (see below) based on information from publications. In a few exceptional cases, custom ANI thresholds were chosen based on recommendations from external experts familiar with these species (Fig. S1, available in the online version of this article).
Custom species-specific ANI thresholds were determined as follows: (1) For a given species with at least four assemblies and with at least one from the type strain, get the ANI of all assemblies that match the type assembly or assemblies. (2) Find the minimum ANI among all the matches from the same species (same-species_min_ANI).
(3) Find the maximum ANI among all the matches from different species (cross-species_max_ANI). (4) If same-species_min_ANI is greater than cross-species_max_ANI and the difference between them is at least 1 % then recommend a value that is 20 % of the difference above cross-species_max_ANI (cross-species_max_ANI + ((same-species_min_ANI -cross-species_max_ANI) * 0.20))

Assemblies not used as types
NCBI evaluates all assemblies, including type assemblies, for potential anomalies: contamination, misassembly, taxonomic misidentification, etc. (for a full list, see https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/). Assemblies that fail one or more tests are considered out-of-scope for inclusion in RefSeq [2]. Assemblies from type material that fail a subset of the tests were excluded and not used as types (see supplementary information for the list of tests). The following Entrez query against the Assemblies from type material that failed some of the tests were still included only if there were no other type assemblies available for the corresponding species (see supplementary material for the list of the tests). However, these type assemblies were not used for making changes such as reclassifying or modifying existing taxonomy of other assemblies. The following query can be used against the NCBI Assembly database (https://www.ncbi.nlm.nih.gov/assembly) to get all type assemblies which belong to this category:

Identifying potentially problematic type assemblies
NCBI uses the ANI method to identify misassigned or contaminated assemblies [5]. In brief, two assemblies are considered matching if their respective coverage is above 80 % and their ANI value is above 96 % (for most species). An assembly that matches a type assembly from a different species is usually considered potentially misidentified. However, for non-type assemblies, if the interspecies matches are from closely related species from the same genus, the submitted species name of the assembly is accepted and not considered misidentified. For type assemblies, stricter criteria were used to ensure that the type assemblies were not misidentified.
A type assembly is considered potentially problematic and flagged for manual review by a curator if any one of the following conditions is true: (1) It doesn't match other type assemblies from its own species (missing-type-matches) (2) It matches a type assembly from the same genus at very high ANI (>98 %) and at least 80 % query and subject coverage (intra-genus-mismatch) (3) It matches a type assembly from a different genus above the ANI threshold (usually 96 %) and with at least 75 % query and subject coverage (inter-genus-mismatch) (4) At least 50 % of the non-type assemblies from the same species with four or more non-type assemblies do not match the type assembly above the ANI threshold and with 75 % query and subject coverage (missing-non-type-matches) A type assembly is considered to be potentially contaminated with another assembly (or assemblies) if at least either 200 000 bp or 5 % of the query assembly matches a type assembly from a different species with at least 95 % ANI. Assemblies that satisfy these conditions are considered potentially contaminated and marked for manual review.
Following the manual review, a potentially problematic type assembly might be rescued and marked as not problematic. If the type assembly was found to be misassigned or contaminated, it is flagged not to be used as type. This will not only remove a problematic type assembly from the ANI process and the public view, it may also rescue other type assemblies that were considered potentially problematic because of this one. In the NCBI FTP file, ANI_ report_ prokaryotes. txt (https://ftp.ncbi.nlm. nih.gov/genomes/ASSEMBLY_REPORTS/ANI_report_prokaryotes.txt), the column, 'assembly-type-category' indicates if the type assembly is considered potentially problematic. NCBI does not initiate changing an assembly taxonomy using potentially problematic type assemblies as evidence.   There are at least 2500 species with at least one assembly but none from a type strain. Most of the species without type assemblies are Candidatus species. The top ten species with highest number of assemblies but without any assembly from type strains, are listed in Table 3. Additionally, an up-to-date full list is also published as an FTP file (https://ftp.ncbi.nlm.nih.gov/genomes/ ASSEMBLY_REPORTS/prokaryote_without_type_assembly.txt). The type strains from the species at the top of the list are high priority candidates for sequencing. The species for which type strains are not available because the strains cannot be cultured, for example, are candidates for manually designated reference type strains.

Assemblies from "type strains" submitted for effectively published taxa and Candidatus species
A taxon that has been described in a journal other than the International Journal of Systematic and Evolutionary Microbiology (IJSEM) is considered to have an effectively published, but not validly published name and has no standing under the ICNP (Rules 25, 27 and 30 [6]). To be considered for 'validly published' status, and subsequently be treated as a 'formal' name in the NCBI Taxonomy, the effectively published names must be submitted to and included in an IJSEM Validation List. However, in a survey in 2018 it was found that 150 such effectively published names per year on average were never submitted for inclusion on the validation lists [9]. Often, it is possible that the only available data for some sparsely sampled lineages are those under effectively  NCBI Taxonomy also maintains the list of taxa with effectively published names and their 'type strains' . Since effectively published names have no standing in nomenclature, NCBI does not use the 'type assemblies' from these taxa to validate taxonomic assignments through ANI. Currently, there are 1500 such taxa and there are at least 900 assemblies from these strains. Of these, there are at least 73 species from 71 genera for which no assembly from any type strain is available (Table 4). Another set of names requires similar attention. Candidatus names are submitted for putative taxa of as yet uncultivated prokaryotes under specific conditions [10]. In several cases 'type strains' were documented, usually resulting in isolate numbers annotated on sequence records. However, these names have no standing under the current ICNP as they are only addressed in an appendix and these strains are indicated as reference strains only. These assemblies were excluded in this study and no actions or corrections through ANI or otherwise are currently done by comparing these assemblies. However, if a broadly accepted standard arises this will be reassessed in future. Table 5 shows the reason and the count of type assemblies that failed one or more requirements for an assembly to be considered for inclusion in RefSeq. The three most common problems were, 'contaminated' i.e., an unintended mixture of two separate species; followed by 'unverified source organism' i.e., the taxonomic assignment of the assembly is misidentified; and 'fragmented assembly' i.e., poor sequence data. The following Entrez query against the NCBI Assembly database (https://www.ncbi.nlm.nih.

Potentially problematic type assemblies
A total of 1033 assemblies were found to be potentially problematic based on the criteria listed in the methods for identifying potentially problematic type assemblies. Table 6 shows the reason and the count of these type assemblies. This list is regarded with more uncertainty than those in Table 5 with not enough information to decide whether they were truly problematic (the suspected type assembly was the only assembly available for its species, for example). A total of 465 type assemblies had either failed to match the other type assemblies from their own species, if available, or matched type assemblies from other species. For example, the type assembly, GCA_001890655.1 from Vibrio fluvialis didn't match the three type assemblies from its own species and matched all the four type assemblies from Vibrio vulnificus (Fig. S2). There were at least 22 pairs of taxa that matched types from a different genus (Table 7). There were 126 taxa with one or more type assemblies for which all type assemblies from these taxa were found to be problematic, essentially leaving these taxa with no type assemblies after curation.

Genomic coherence of assemblies from type and/or co-identical strains
Assemblies from all co-identical type strains are expected to be identical and ANI can be used for cross verification. At a minimum, these assemblies should reciprocally match each other above the ANI threshold and above 80 % query and subject coverage. Fig. 1 shows the genomic coherence among the type assemblies from the same species. Most of the assemblies from the same species Assemblies from type and/or co-identical strains from the same species were mostly similar with fewer outliers. Genomic coherence or similarity among a pair of type assemblies from the same species was measured using average nucleotide identity (ANI) and symmetric overlap (matched region length over the total length among a pair of assemblies). Assemblies were grouped by their assembly levels (see supplementary information for descriptions of assembly levels) to check if the difference in assembly levels could explain any lack of coherence among type assemblies from the same species. Table 8 lists the extreme outliers.
were highly similar (ANI above the threshold) except for the differences in coverage. At least one or more type assemblies from 18 taxa are significantly different from the rest of the type assemblies from their corresponding species (Table 8).
ANI-based curation helps to identify potentially problematic type strains or assemblies from the type strains. However, this can be influenced by various confounding factors, such as accumulated mutations, mislabeling, contamination or sequencing errors.

Assemblies from type strains are not always representative of their corresponding species, especially for species with broad diversity
Type assemblies are not necessarily obtained from typical representatives of their associated species as noted in ICNP Rule 15. This is especially evident for species with broader genomic diversity. The type strains just happen to be selected from the first isolates analysed for a novel species. Amongst the 1600 species with at least four non-type assemblies that were examined, more than 50 % of the non-type assemblies in 310 species do not match the type assemblies from their corresponding species above the expected ANI threshold of 96 % (missing-non-type-matches; see Methods). A list of species with high intraspecies genomic diversity that were identified in this study can be found here (https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS / prokaryote_ ANI_ type_ not_ matching. txt). It is possible that some of these assemblies that do not match their type assemblies were misidentified. In other cases, this reflects a more complex genomic diversity (and undescribed, potential taxa) that was corroborated by literature. For example, there are three biotypes (bt1, bt2 and bt3) for Vibrio vulnificus [11], four distinct groups (Group I, II, III and IV) for Clostridium botulinum [12] and two genomospecies groups (GS1 and GS2) for Campylobacter concisus [13]. In these cases, the type strain does not adequately represent all these subspecific genetically divergent groups that are not formally named. For example, only nine of the 253 non-type assemblies match the type assembly in Campylobacter concisus. Table 8. List of taxa whose assemblies from co-identical strains* differ significantly among themselves (lower ANI values)

Scientific name
Assembly accessions from type strains Fig. 2 shows a few examples of broad intraspecies genomic diversity among assemblies as measured by their ANI and coverage of all non-type assemblies against their corresponding type assemblies.
Intraspecies genomic diversity can be addressed by lowering the ANI threshold that includes all assemblies from the species as defined or by considering additional representative assemblies in addition to the type assemblies or a combination of both. For species with subspecific clusters or groups, additional representative assemblies can be selected for each cluster or group whilst awaiting additional taxonomic work to label these formally as potentially novel species or subspecies. The Centres for Disease Control and Prevention (CDC), as part of the PulseNet project (https://www.cdc.gov/pulsenet/index.html), uses multiple representative assemblies for many species in order to ensure that this diversity is adequately addressed in computational comparisons. For example, there are at least two clusters for Campylobacter lari and CDC uses one representative (none of which are from the type strain) for each cluster (Fig. S3, personal communication). In this study, as a proof of concept, CDC representative assemblies from two species (Vibrio vulnificus and Listeria monocytogenes) were adopted as extra representatives in addition to type assemblies. Adding additional representatives improved the taxon identification for these two species. Fig. 3 shows the best match ANI values of all assemblies from these two species against (1) only their corresponding type assemblies and (2) after adding corresponding, representative assemblies. 99 % of the V. vulnificus and 8 % of the L. monocytogenes assemblies that previously did not match their corresponding type assemblies above the expected ANI threshold (96 %), matched their corresponding representative assemblies above the expected ANI threshold.

ANI of some type assemblies do not support synonymizing or merging of two taxa
A species can only have one correct name (referred to as the 'current name' in NCBI Taxonomy) and one set of co-identical type strains. When two taxa that were described independently with separate type strain declarations are determined to be the same species, the earlier name will have precedence and the later one would be considered as a later heterotypic synonym of the initially described taxon. The type strains associated with the heterotypic synonym have no standing with regard to the correct species name. Conversely, on the basis of the results of new analysis, existing heterotypic synonyms have been reclassified back as independent species (e.g., Bacillus axarquiensis [14]). ANI can be used to assess the heterotypic synonymy relationship. If the species is correctly defined, assemblies from heterotypic synonyms (referred to as 'syntype assemblies') are expected to be highly similar (i.e., higher ANI and coverage) to the assemblies from type strains of their corresponding species.
There are at least 257 taxa with assemblies from heterotypic synonyms and of these at least 220 have at least one assembly available from both type strains of their corresponding species and the heterotypic synonym. Fig. 4 shows the ANI values of intra-type assemblies (blue open circles) and type vs syntype assemblies (red open circles). There are at least 27 taxa for which the assemblies from heterotypic synonyms do not match the type assemblies of their corresponding species above the default ANI threshold of 96 (top 27 taxa in the Fig. 4). There are 26 taxa for which the assemblies from heterotypic synonyms do match above the ANI threshold, but the ANI value is lower than the ANI of the type vs type matches. Assuming there are no problems with the identification and sequencing of the assemblies considered, the synonymy of these taxa may need to be reconsidered.

Species-specific ANI threshold
A default ANI threshold of 96 % has been considered to be enough to distinguish between assemblies from two different species [5,15]. However, there have been cases where a threshold higher or lower than the 96 % threshold was required to define species boundaries. Using the automated process described in the methods, it was possible to propose custom ANI thresholds for 67 taxa (Fig. S4). For example, multiple assemblies from Spiroplasma melliferum match the type assembly of Spiroplasma citri above the default 96 % ANI threshold and using a higher ANI threshold of 98.8 % for S. citri separates these two species of the genus Spiroplasma (Fig. S4). Similarly, several assemblies from Mycoplasma mycoides match the type assemblies from their own species below the default 96 % ANI threshold. By lowering the ANI threshold for M. mycoides to 94.4%, all assemblies from the species Red circles indicate the best match ANI value of assemblies against only their type assemblies and blue circles indicate the new best match ANI value after including the additional representative assemblies. 8 % of L. monocytogenes and 99 % of V. vulnificus assemblies that previously didn't match their type assemblies matched the newly added representative assemblies at above the expected ANI threshold. Fig. 4. ANI-based verification of heterotypic synonymization. When two independently described taxa were determined to be the same species, the two taxa would be merged and the taxon that was described later would become the heterotypic synonym of the taxon that was described earlier.
Lower ANI values of assemblies from heterotypic synonyms (referred to as 'syntype assemblies' or 'syntype') against the assemblies from type strains (referred to as 'type assemblies' or 'type') from the same species (red circles) indicate potentially problematic synonymizations. There were at least 27 cases where the ANI values of the assemblies from heterotypic synonyms were lower than the ANI threshold of the corresponding species. The dotted vertical line indicates the default ANI threshold of 96 %. match the type assemblies (Fig. S4). A full list of species and their custom ANI thresholds is available as an FTP file (https://ftp. ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/ prokaryote_ ANI_ species_ specific_ threshold. txt).

ANI-based validation of assemblies submitted to GenBank using type assemblies
Type assemblies were used as references for validating or reclassifying the taxonomy of all assemblies submitted to GenBank. The taxonomic check status for more than 1.1 million assemblies is summarized in the FTP file (https://ftp.ncbi.nlm.nih.gov/genomes/ ASSEMBLY_REPORTS/ANI_report_prokaryotes.txt). In the last 5 years, the taxonomy of over 7000 newly sequenced genomes has been reclassified before acceptance to GenBank as has been the taxonomy of 1800 genomes that are already in GenBank. The taxonomy of 137 000 existing GenBank genomes cannot be validated as there is not sufficient data.

Conclusions
The large-scale availability of prokaryotic genomes has already influenced prokaryotic taxonomy in a major way, forcing a shift away from a classification largely based on 16S sequences [16,17] and opening up the possibility of adapting practices [18] in order to sample and reference microbial diversity comprehensively [19]. A stable, genome-based classification requires well validated references. In this paper, we have highlighted several processes introduced, over time, to annotate, validate and correct type genome assemblies, relying on the published literature, authoritative online resources and expert input as far as possible. Several additional processes at NCBI are available to improve data during the submission process including recent improvements in the Prokaryotic Genome Annotation Pipeline (PGAP [20]) that allow for pre-submission verifications. Nevertheless, improving and accurately filtering out imprecise and inaccurate content remains a daunting task and users should always treat available data with healthy scepticism and perform appropriate verifications where possible [21,22]. The use of type genomes remains central to any of these actions and continued vigilance and adjustments are needed to ensure major errors are kept to a minimum. In some cases, the limitations of type genomes are evident and additional adjustments are needed to address underlying genetic variation. At the same time, prokaryotic taxonomy is undergoing major changes, specifically the treatment of uncultured, environmentally sampled genomes, which do not have well fleshed out nomenclature, is posing challenges for data re-use [23]. Proposals to deal with uncultivated taxa range from utilizing a separate code of nomenclature, the SeqCode [24] to extending the use of Candidatus names [25], possibly by utilizing algorithms creating neutral latinized labels that follow current grammatical rules [26]. Some of these proposals have already been debated and not incorporated into the ICNP [27]. Nevertheless, curators and archivists at the public sequence repositories will have to follow any discussion closely and consider adopting and improving ANI and related processes if any broad consensus emerges.
There is a range of resources available online supporting nomenclature and taxonomy of prokaryotes (listed in [16]). As a central depository of public genome sequences that also serves multiple other resources and uses, we believe the most important part of our actions in NCBI Taxonomy and Assembly resources should be devoted to making the taxonomic annotation associated with type assemblies as accurate as possible, while continuing to extend and improve on the processes described here. With the existing curated type assemblies and ANI-based processes we were able to reclassify the taxonomy of over 8800 non-type assemblies during and post submission to GenBank. However, there are still many species without any assemblies from type strains and we have highlighted the high priority candidates for sequencing. We welcome input from the research community to improve our ANI-based curation of public type and non-type assemblies.

Funding Information
This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health (NIH).